GVPT Maths Boot Camp

Data Wrangling

Learning objectives

  1. Import your data

  2. Clean your data

  3. Explore relational data

Data wrangling

Source: R4DS

World Bank GDP data

https://data.worldbank.org/indicator/NY.GDP.MKTP.CD

  • Official source of global and historical GDP data

  • A very common control variable for IR and CP analysis

  • Very frustratingly messy!

Download global GDP data

Source: World Bank

Working with external data

Introducing here::here()

install.packages("here")

Points to where you are on your computer. Updates for everyone on any computer!

here::here()
[1] "/Users/harrietgoers/Documents/intro_to_r_ps"

Reading in your csv

library(tidyverse)

gdp_raw <- read_csv(here::here("slides", "data-raw", "wb_gdp.csv"))

head(gdp_raw)
# A tibble: 6 × 3
  `Data Source`               `World Development Indicators` ...3               
  <chr>                       <chr>                          <chr>              
1 Last Updated Date           2023-03-01                      <NA>              
2 Country Name                Country Code                   "Indicator Name,In…
3 Aruba                       ABW                            "GDP (current US$)…
4 Africa Eastern and Southern AFE                            "GDP (current US$)…
5 Afghanistan                 AFG                            "GDP (current US$)…
6 Africa Western and Central  AFW                            "GDP (current US$)…


EXERCISES

  1. Read the ?read_csv help file. What arguments does this function take?

  2. Head to the readr package documentation and find what other file types you can read in.

Skipping non-relevant rows

Source: World Bank

Skipping non-relevant rows

gdp_raw <- read_csv(here::here("slides", "data-raw", "wb_gdp.csv"), skip = 4, col_select = 1:66)

head(gdp_raw)
# A tibble: 6 × 66
  Country…¹ Count…² Indic…³ Indic…⁴   `1960`   `1961`   `1962`   `1963`   `1964`
  <chr>     <chr>   <chr>   <chr>      <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1 Aruba     ABW     GDP (c… NY.GDP… NA       NA       NA       NA       NA      
2 Africa E… AFE     GDP (c… NY.GDP…  2.13e10  2.18e10  2.37e10  2.82e10  2.61e10
3 Afghanis… AFG     GDP (c… NY.GDP…  5.38e 8  5.49e 8  5.47e 8  7.51e 8  8.00e 8
4 Africa W… AFW     GDP (c… NY.GDP…  1.04e10  1.11e10  1.19e10  1.27e10  1.38e10
5 Angola    AGO     GDP (c… NY.GDP… NA       NA       NA       NA       NA      
6 Albania   ALB     GDP (c… NY.GDP… NA       NA       NA       NA       NA      
# … with 57 more variables: `1965` <dbl>, `1966` <dbl>, `1967` <dbl>,
#   `1968` <dbl>, `1969` <dbl>, `1970` <dbl>, `1971` <dbl>, `1972` <dbl>,
#   `1973` <dbl>, `1974` <dbl>, `1975` <dbl>, `1976` <dbl>, `1977` <dbl>,
#   `1978` <dbl>, `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>,
#   `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>,
#   `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>,
#   `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, …

Take a look at your data

Taking a look at this data set:

skimr::skim(gdp_raw)


What types of data do we have? Are they the right type of data?


Are we missing data points?

Tidy Data Structures

Source: R4DS

Tidying World Bank data

What do you want to do with your data?

  • I want to analyse country, regional, and global trends in GDP over time

What I need:

  • Annual data on each country’s GDP

  • The region to which each country belongs

Tidying World Bank data

To do:

  • Move the yearly data from columns to rows

  • Clean up these column names so that they are easier to use in R

  • Add regional data

Pivoting your data

gdp_df <- pivot_longer(
  data = gdp_raw, 
  cols = `1960`:`2021`,
  names_to = "year",
  values_to = "gdp"
)


colnames(gdp_raw)
 [1] "Country Name"   "Country Code"   "Indicator Name" "Indicator Code"
 [5] "1960"           "1961"           "1962"           "1963"          
 [9] "1964"           "1965"           "1966"           "1967"          
[13] "1968"           "1969"           "1970"           "1971"          
[17] "1972"           "1973"           "1974"           "1975"          
[21] "1976"           "1977"           "1978"           "1979"          
[25] "1980"           "1981"           "1982"           "1983"          
[29] "1984"           "1985"           "1986"           "1987"          
[33] "1988"           "1989"           "1990"           "1991"          
[37] "1992"           "1993"           "1994"           "1995"          
[41] "1996"           "1997"           "1998"           "1999"          
[45] "2000"           "2001"           "2002"           "2003"          
[49] "2004"           "2005"           "2006"           "2007"          
[53] "2008"           "2009"           "2010"           "2011"          
[57] "2012"           "2013"           "2014"           "2015"          
[61] "2016"           "2017"           "2018"           "2019"          
[65] "2020"           "2021"          
colnames(gdp_df)
[1] "Country Name"   "Country Code"   "Indicator Name" "Indicator Code"
[5] "year"           "gdp"           

Clean column names

Column names should not:

  • Have spaces

  • Start with numbers

Introducing janitor:

# Install the `janitor` package

install.packages("janitor")
gdp_df <- janitor::clean_names(gdp_df)

colnames(gdp_df)
[1] "country_name"   "country_code"   "indicator_name" "indicator_code"
[5] "year"           "gdp"           

Add region data

Introducing the countrycode package - the indispensable workhorse of country data:

# Install the `countrycode` package

install.packages("countrycode")


library(countrycode)

# Add each country's World Bank region to the data set

gdp_df <- gdp_df |> 
  mutate(region = countrycode(country_name, "country.name", "region", custom_match = c("Turkiye" = "Europe & Central Asia"))) |> 
  # Remove observations that are regions, eg. "East Asia & Pacific" or "World"
  drop_na(region) |> 
  relocate(region, .after = "country_code")

Add region data

gdp_df |> 
  distinct(country_name, region) |> 
  head(10)
# A tibble: 10 × 2
   country_name         region                    
   <chr>                <chr>                     
 1 Aruba                Latin America & Caribbean 
 2 Afghanistan          South Asia                
 3 Angola               Sub-Saharan Africa        
 4 Albania              Europe & Central Asia     
 5 Andorra              Europe & Central Asia     
 6 United Arab Emirates Middle East & North Africa
 7 Argentina            Latin America & Caribbean 
 8 Armenia              Europe & Central Asia     
 9 American Samoa       East Asia & Pacific       
10 Antigua and Barbuda  Latin America & Caribbean 

Make sure all data are the right type

glimpse(gdp_df)
Rows: 13,454
Columns: 7
$ country_name   <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "…
$ country_code   <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW",…
$ region         <chr> "Latin America & Caribbean", "Latin America & Caribbean…
$ indicator_name <chr> "GDP (current US$)", "GDP (current US$)", "GDP (current…
$ indicator_code <chr> "NY.GDP.MKTP.CD", "NY.GDP.MKTP.CD", "NY.GDP.MKTP.CD", "…
$ year           <chr> "1960", "1961", "1962", "1963", "1964", "1965", "1966",…
$ gdp            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Make sure all data are the right type

gdp_df <- transmute(
  gdp_df,
  country_name, 
  region, 
  year = as.integer(year),
  gdp
)

glimpse(gdp_df)
Rows: 13,454
Columns: 4
$ country_name <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Ar…
$ region       <chr> "Latin America & Caribbean", "Latin America & Caribbean",…
$ year         <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 196…
$ gdp          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

How do countries compare over time?

ggplot(gdp_df, aes(
  x = year, y = gdp, colour = region, group = country_name
)) + 
  geom_line() + 
  theme_minimal()

How do regions compare over time?

By default, R will carry forward NAs. This is good!

gdp_df |> 
  group_by(region, year) |> 
  summarise(avg_gdp = mean(gdp)) |> 
  ggplot(aes(x = year, y = avg_gdp, colour = region)) + 
  geom_line() + 
  theme_minimal()

Dealing with missing data

Summary

Today you:

  • Learnt how to read in external data

  • Learnt how to clean common data problems using R